-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065
Conversation
3f82ff6
to
3b8c4f1
Compare
related issue:#15016 |
3b8c4f1
to
a4680eb
Compare
a4680eb
to
f214941
Compare
…ile the thread t_dplane is waiting to be scheduled. This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx. This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems. Signed-off-by: hanyu.zly <[email protected]>
f214941
to
64aa037
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's good to be thinking about how to do rate-matching or backpressure within the system, but I don't think this is really the solution. I think this needs more discussion.
@@ -4974,6 +4979,7 @@ static void rib_process_dplane_results(struct event *thread) | |||
} | |||
|
|||
} while (1); | |||
t_dplane = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we can't just poke at these event pointers like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify. Just setting the pointer to NULL doesn't actually stop the event from running. FRR has just lost all control to it at all. This is especially bad because if the event in question has pointers stored in it and then those pointers are freed we'll have use after free's because the rest of FRR assumes that the event pointer as NULL the event is not running. We've had too many crashes to allow this to get into the system again.
Yeah this is a decent idea, but the execution is all wrong. Not doing any work while the dplane has initiated a call back with the results is not the way I want to proceed with this. The problem is that the dplane_fpm_nl module is holding onto the contexts when the underlying process that is being communicated too is not responding in a fast enough fashion. The underlying dplane should signal that it is backed up via a new dplane context command call and signal when it is no longer backed up as well. Then the master pthread can then decide to initiate more work or not. Additionally the dplane can decide at what context count should trigger the please hold off request and at what context count it should tell the upper level to start working again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've outlined a different approach that would be acceptable. In the current state this approach is dead in the water. There might also need to be work to have the workqueue be smarter about when to come back to do more work here as well instead of busy looping
Is it correct that the real source of this specific problem is fpm - that the other end of the fpm session is just ... slow to read? if that's true, then that really needs to get fixed first (imo). |
I understand your perspective, but I think there might be some differences in our understanding of this section of the code. |
Actually, the reason isn't that the FPM reads data slowly. Please refer to my previous reply, thank you. |
so there are (at least) a couple of things going on here. there's the sort of practical, low-level issue of how to communicate between pthreads - whether it is viable, workable to use an event* to try to do that. but the real issue is what to do when there's a rate mismatch within zebra - what components can detect a mismatch, what components should react to it, and what that reaction should be. it's not viable to play with the event* outside the library lock, full-stop. that value is set by the dplane pthread, under a lock, and is cleared/NULLed by the event library in the context of the main pthread, under a lock. every time you look at that value you introduce a race, and any time you touch it you introduce a race and you risk leaking memory (a small amount, it's true). About the fpm plugin - it does look inefficient to me, just on casual inspection, but I didn't really follow this statement:
the actual netlink code is making a system call to send messages to the kernel, then it's waiting for a reply with status for the requests that were in the netlink batch. that's not a cheap operation. the fpm code only has to send on a socket, and if that's slower than the kernel operation, that's something we should be able to fix. the fpm code takes a lot of locks, instead of dequeuing batches. the fpm code doesn't ... just get on with its work, taking a batch of contexts and sending them out - it schedules an event to itself to look at a queue, and then it puts bytes into a stream buffer, and then it schedules itself to actually write the stream buffer to the socket. it looks like that could be cleaned up, and that might improve the throughput in that plugin. the higher-level question seems to be: if one dplane plugin is running behind, running slowly, what should happen? right now, the dplane only really checks its own top-level queue, the initial queue that receives contexts from the main pthread. it could check the providers' queues, and try using those as a proxy for the state of the overall system. but that won't be able to detect queue buildup inside a provider's own data structures. or the providers could have a way to signal to the dplane framework that they are building queue - and the dplane framework could use that signal to slow down the rate at which it accepts new incoming work. you could make the case that the overall system shouldn't overrun the slowest component; but neither of those things exists right now though. where do you think the bottleneck is? do you think it's in the less-than-perfect fpm plugin? or do you think it's in the zebra main pthread, trying to work through a big impulse of changes? |
This makes no sense to me. The very fact that this solution was raised as fixing the problem clearly implies that blocking the processing of new routes while the master zebra pthread has contexts to handle slows the master pthread down enough to allow the dplane pthread to clear it's queues. This also implicitly implies that the contexts are sitting in the rib_dplane_q. Which is a bit odd. What queue exactly are the contexts sitting in?
This sure sounds like it's needed as a seperate thing to fix in it's own right.
This also sounds like a clearer path of fixing that needs to be done and frankly sounds like a higher value problem that should be addressed first.
This doesn't make a whole bunch of sense to me. In my local testing I cannot make the context queue grow when using dplane_fpm_nl. I am also not using fpmsyncd as a listener. I do know that I can make it grow arbitrarily when I intentionally introduce a delay in responding. Perhaps we don't fully understand where the problem is at this point in time. |
This PR is stale because it has been open 180 days with no activity. Comment or remove the |
This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx. This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems.